William Loving (wfl9zy) James Sweat (jes9hd)
- Explore and visualize the broader Computer Science / Data Analysis industry fields.
- Discover interesting correlations between attributes of available jobs using multiple different Datasets.
- Learn how to develop meaningful visualizations to communicate the data we have to an uninformed audience.
Here we will explore Data Scientist Jobs in an around the United States
data <- read_csv("../data/data-science-jobs/ds_salaries.csv")
## New names:
## Rows: 607 Columns: 12
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (7): experience_level, employment_type, job_title, salary_currency, empl... dbl
## (5): ...1, work_year, salary, salary_in_usd, remote_ratio
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
head(data)
## # A tibble: 6 × 12
## ...1 work_year experience_level employment_type job_title salary
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 0 2020 MI FT Data Scientist 70000
## 2 1 2020 SE FT Machine Learning Scie… 260000
## 3 2 2020 SE FT Big Data Engineer 85000
## 4 3 2020 MI FT Product Data Analyst 20000
## 5 4 2020 SE FT Machine Learning Engi… 150000
## 6 5 2020 EN FT Data Analyst 72000
## # ℹ 6 more variables: salary_currency <chr>, salary_in_usd <dbl>,
## # employee_residence <chr>, remote_ratio <dbl>, company_location <chr>,
## # company_size <chr>
## spc_tbl_ [607 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:607] 0 1 2 3 4 5 6 7 8 9 ...
## $ work_year : num [1:607] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ experience_level : chr [1:607] "MI" "SE" "SE" "MI" ...
## $ employment_type : chr [1:607] "FT" "FT" "FT" "FT" ...
## $ job_title : chr [1:607] "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
## $ salary : num [1:607] 70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
## $ salary_currency : chr [1:607] "EUR" "USD" "GBP" "USD" ...
## $ salary_in_usd : num [1:607] 79833 260000 109024 20000 150000 ...
## $ employee_residence: chr [1:607] "DE" "JP" "GB" "HN" ...
## $ remote_ratio : num [1:607] 0 0 50 0 50 100 100 50 100 50 ...
## $ company_location : chr [1:607] "DE" "JP" "GB" "HN" ...
## $ company_size : chr [1:607] "L" "S" "M" "S" ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. work_year = col_double(),
## .. experience_level = col_character(),
## .. employment_type = col_character(),
## .. job_title = col_character(),
## .. salary = col_double(),
## .. salary_currency = col_character(),
## .. salary_in_usd = col_double(),
## .. employee_residence = col_character(),
## .. remote_ratio = col_double(),
## .. company_location = col_character(),
## .. company_size = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## ...1 work_year experience_level employment_type
## Min. : 0.0 Min. :2020 Length:607 Length:607
## 1st Qu.:151.5 1st Qu.:2021 Class :character Class :character
## Median :303.0 Median :2022 Mode :character Mode :character
## Mean :303.0 Mean :2021
## 3rd Qu.:454.5 3rd Qu.:2022
## Max. :606.0 Max. :2022
## job_title salary salary_currency salary_in_usd
## Length:607 Min. : 4000 Length:607 Min. : 2859
## Class :character 1st Qu.: 70000 Class :character 1st Qu.: 62726
## Mode :character Median : 115000 Mode :character Median :101570
## Mean : 324000 Mean :112298
## 3rd Qu.: 165000 3rd Qu.:150000
## Max. :30400000 Max. :600000
## employee_residence remote_ratio company_location company_size
## Length:607 Min. : 0.00 Length:607 Length:607
## Class :character 1st Qu.: 50.00 Class :character Class :character
## Mode :character Median :100.00 Mode :character Mode :character
## Mean : 70.92
## 3rd Qu.:100.00
## Max. :100.00
data_transformed <- data%>%
mutate(experience_level = ifelse(experience_level=="EN", "Entry-Level",
ifelse(experience_level=="MI", "Manager-Level",
ifelse(experience_level=="SE", "Senior-Level",
ifelse(experience_level=="EX", "Executive-Level", experience_level)))))
data_transformed <- data_transformed%>%
mutate(employment_type = ifelse(employment_type=="CT", "Contract-Work",
ifelse(employment_type=="FT", "Full-Time",
ifelse(employment_type=="PT", "Part-Time",
ifelse(employment_type=="FL", "FreeLance", employment_type)))))
data_transformed <- data_transformed%>%
mutate(company_size = ifelse(company_size=="L", "Large",
ifelse(company_size=="M", "Medium",
ifelse(company_size=="S", "Small", company_size))))
data_transformed <- data_transformed%>%
mutate(remote_ratio = ifelse(remote_ratio==0, "In-Person",
ifelse(remote_ratio==50, "Hybrid",
ifelse(remote_ratio==100, "Remote", remote_ratio))))
head(data_transformed)
## # A tibble: 6 × 12
## ...1 work_year experience_level employment_type job_title salary
## <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 0 2020 Manager-Level Full-Time Data Scientist 70000
## 2 1 2020 Senior-Level Full-Time Machine Learning Scie… 260000
## 3 2 2020 Senior-Level Full-Time Big Data Engineer 85000
## 4 3 2020 Manager-Level Full-Time Product Data Analyst 20000
## 5 4 2020 Senior-Level Full-Time Machine Learning Engi… 150000
## 6 5 2020 Entry-Level Full-Time Data Analyst 72000
## # ℹ 6 more variables: salary_currency <chr>, salary_in_usd <dbl>,
## # employee_residence <chr>, remote_ratio <chr>, company_location <chr>,
## # company_size <chr>
- With this plot we can clearly see that as your experience level rises, you can expect to see a corresponding increase in salary.
- It is also worth noting that different types of work see different effects, for example, contract work is much more volatile than Full Time salaries.
plot <- ggplot(data_transformed, aes(x=experience_level, y=salary_in_usd, fill=employment_type)) +
geom_bar(stat='identity', position='dodge') +
labs(
x="Experience Level",
y="Salary in $USD",
fill="Employment Type",
title="The effects of Experience Level on Salary"
) +
scale_x_discrete(limits = c("Entry-Level", "Senior-Level", "Manager-Level","Executive-Level")) +
theme_minimal()
ggplotly(plot)
- Note that In-Person only paid the highest for Medium Sized Companies, Remote actually had the highest payout for Large
- Small companies pay grows step-wise with respect to the remote ratio (Hybrid->In-Person->Remote)
plot <- ggplot(data_transformed, aes(x=company_size, y=salary_in_usd, fill=remote_ratio)) +
geom_bar(stat='identity', position='dodge') +
labs(
x="Company size",
y="Salary in $USD",
fill="Remote Ratio",
title="The effects of Company Size and Remote Ratio on Salary"
) +
scale_x_discrete(limits = c("Small", "Medium", "Large")) +
theme_minimal()
ggplotly(plot)
- A lot of information, but the most interesting is that the US has the highest paying jobs by far with Small companies in Japan as a close second.
plot <- ggplot(data_transformed, aes(x=company_location, y=salary_in_usd, fill=company_size)) +
geom_bar(stat='identity', position='dodge') +
labs(
x="Company Location",
y="Salary in $USD",
fill="Company Size",
title="The effects of Company Location and Size on Salary"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplotly(plot)